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An active area of research in the fields of machine learning and statistics is the development of 
causal discovery algorithms, the purpose of which is to infer the causal relations that hold among a 
set of variables from the correlations that these exhibit. We apply some of these algorithms to the 
correlations that arise for entangled quantum systems. We show that they cannot distinguish corre- 
lations that satisfy Bell inequalities from correlations that violate Bell inequalities, and consequently 
that they cannot do justice to the challenges of explaining certain quantum correlations causally. 
Nonetheless, by adapting the conceptual tools of causal inference, we can show that any attempt 
to provide a causal explanation of nonsignalling correlations that violate a Bell inequality must 
contradict a core principle of these algorithms, namely, that an observed statistical independence 
between variables should not be explained by hne-tuning of the causal parameters. We demonstrate 
the need for such fine-tuning for most of the causal mechanisms that have been proposed to underlie 
Bell correlations, including superluminal causal influences, superdeterminism (that is, a denial of 
freedom of choice of settings), and retrocausal influences which do not introduce causal cycles. 



I. INTRODUCTION 

A causal relation, unlike a correlation, is an asymmet- 
ric relation that can support inferences about the conse- 
quences of interventions and about counterf actuals. The 
sun rising and the rooster crowing are strongly correlated, 
but to say that the first is the cause of the second is to 
say more. In particular, it says that forcing the rooster 
to crow early will not precipitate an early dawn, whereas 
causing the sun to rise early (for instance, by moving 
the rooster eastward), can lead to some early crowing. 
Nonetheless, causal structure has implications for the ob- 
served correlations and consequently one can make infer- 
ences about the causal structure based on the observed 
correlations. Indeed, there has been much progress in the 
last twenty-five years on how to make such inferences, 
progress that has been primarily due to philosophers and 
researchers in the field of machine learning and which is 
well summarized in the books of Pearl [T] and of Spirtes, 
Glymour and Scheines (SGS) [2]. Such inference schemes 
are known as causal discovery algorithms. In this article, 
we shall consider the question of what some prominent 
causal discovery algorithms have to say about the causal 
structure that might underlie quantum correlations, in 
particular those that violate Bell inequalities. 

Suppose that one conducts measurements on a pair of 
systems that have been prepared together and then re- 
moved to distant locations such that the outcome at each 
wing of the experiment is outside the future light cone 
of the measurement choice in the other wing. Suppose 
further that one finds that the correlations so obtained 
violate Bell inequalities. If one insists on a causal expla- 
nation of these correlations, then it would seem that one 
must admit that the causes must propagate faster than 
the speed of light. But this is in tension with the fact that 
one cannot send signals faster than the speed of light. We 



take this tension to be the mystery of Bell's theorem: if 
there are indeed superluminal causes, then why can't we 
use them to send superluminal signals? In this article, 
we will show that the principles behind causal discovery 
algorithms can clarify the nature of this tension. We also 
show that this tension persists in more exotic propos- 
als for a causal explanations of Bell inequality violations 
such as superdeterminism, which is an assumption that 
one is not free to choose the measurement setting, and 
retrocausation, wherein causes propagate counter to the 
standard direction of time. 

Our analysis will also reveal some significant inadequa- 
cies of certain existing causal discovery algorithms when 
applied to Bell experiments and therefore we believe that 
some of the expertise developed in the field of quantum 
foundations on causal explanations of correlations may 
lead to improvements in these algorithms. 

The distinction between causal and inferential con- 
cepts is an instance of the distinction between ontic con- 
cepts (those pertaining to reality) and epistemic concepts 
(those pertaining to our knowledge of reality). Within 
the field of statistics, disentangling causal and inferential 
concepts is notoriously difficult and controversial, as is 
the question of when causal claims are supported by the 
observed correlations. In the quantum realm, where there 
is even less agreement about which parts of the formal- 
ism refer to ontic concepts and which refer to epistemic 
concepts, the problem is compounded 3J. As such, we 
shall try to present our analysis in a manner that does 
not presume any particular interpretation of quantum 
theory. For instance, given that different interpretations 
disagree on whether quantum theory implies an objective 
indeterminism in nature or not, we shall not presume any 
particular answer to this question. Instead, we simply fo- 
cus on the operational predictions of the theory 

The algorithms we consider take as their input the set 
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of conditional independences that hold in a probability 
distribution over observed variables; no other feature of 
the probability distribution is relevant for them. In addi- 
tion, we consider only algorithms that look at statistical 
independences; those that use algorithmic independences 
are not considered [4]. 

Some previous work has already considered Bell's the- 
orem from the perspective of causal discovery algorithms. 
In particular, the books by Pearl jT] and by SGS [2] com- 
ment briefly on the question. They both assert that Bell's 
theorem forces a dilemma between abandoning a partic- 
ular notion of locality, that there are no supcrluminal 
causal influences, and abandoning what we will call Re- 
ichenbach's principle, which is the assumption that cor- 
relations are to be explained either by direct causation 
or a common cause. One can legitimately quibble with 
this conclusion on the grounds that there are other as- 
sumptions that go into Bell's theorem: the freedom in 
the choice of settings and the absence of retrocausal in- 
fluences, for instance. Nonetheless, we feel that this is an 
improvement over the standard characterization of Bell's 
theorem as forcing a dilemma between abandoning local- 
ity and abandoning realism. It has always been rather 
unclear what precisely is meant by "realism" . Norsen 
has considered various philosophical notions of realism 
and concluded that none seem to have the feature that 
one could hope to save locality by abandoning them [S]. 
For instance, if realism is taken to be a commitment to 
the existence of an external world, then the notion of 
locality - that every causal influence between physical 
systeems propagates subluminally - already presupposes 
realism. 

Where previous work in causal discovery has made 
claims about what assumption it is best to give up in 
the face of Bell inequality violations, it has fallen on the 
side of abandoning Reichenbach's principle. 1 We will 
take a different tack in the present work. Reichenbach's 
principle will not be questioned, In other words, we will 
hold fast to the notion that all correlations - includ- 
ing those predicted by quantum theory - need to be ex- 
plained causally, and we will explore what insights may 
be gained from causal discovery algorithms under this 
assumption. One reason to proceed in this manner is 
that the idea of explaining correlations causally appears 
to us to be central to the scientific enterprise. Indeed, 
causal hypotheses are indispensable when applying sci- 
entific theories to pragmatic ends because, unlike corrc- 



For instance, in Ref. [6], Glymour argues that there is not even 
a dilemma, that one must abandon Reichenbach's principle. He 
argues for this on the grounds that a supcrluminal causal influence 
would imply superluminal signalling, and the latter is not observed 
experimentally. However, this argument is incorrect because there 
are causal models that posit superluminal causal influences but 
which do not lead to supcrluminal signals, for instance the interpre- 
tation of quantum theory wherein the wavefunction is a complete 
description of reality. In such models, the Markov condition can 
be salvaged. 



lations, they support inferences about the consequences 
of actions. 

In any case, our main conclusions have not been high- 
lighted in the previous literature on Bell's theorem and 
causal discovery algorithms. 

Our first conclusion is a relatively straightforward one. 
We note that in the case of quantum correlations for a 
pair of correlated systems, all correlations exhibit the 
following conditional independence relations among the 
observable variables: 

1. Marginal independence of the setting variables, 

2. No-signalling, that is, conditional independence of 
the outcome at one wing of the experiment from 
the setting at the opposite wing given the setting 
at the first wing, 

and for all but a set of measure zero of experimental 
scenarios, these are the only independences. These inde- 
pendences characterize both the correlations that satisfy 
all the Bell inequalities, and the correlations that violate 
some Bell inequality. Therefore, if the causal discovery 
algorithm takes as its input not the full distribution but 
only the conditional independence relations that hold in 
the distribution (as is the case with the prominent such 
algorithms), then this algorithm cannot distinguish corre- 
lations that violate Bell inequalities from correlations that 
satisfy Bell inequalities. The input to such algorithms is 
simply too impoverished to see the difference. It follows 
that the causal distinctions that do exist between these 
correlations — those that are implied by Bell's theorem 
— cannot be recognized by these algorithms. They will 
consequently make incorrect assessments of what causal 
structure is implied by a given set of correlations. 

It is nonetheless interesting to see what the algorithms 
return as possible causal structures for no-signalling cor- 
relations. We look at both the case where one presumes 
that the settings and outcomes are the only causally rel- 
evant variables, i.e., the case of no hidden variables, and 
the case where one imagines that hidden variables may be 
causally relevant. Our main conclusion is that any causal 
model that can hope to explain Bell-inequality-violating 
correlations (or EPR correlations without recourse to 
hidden variables) has the feature that in order to ex- 
plain the statistical independencies among the observed 
variables, in particular the no-signalling constraints, the 
model must involve a fine-tuning of the causal parame- 
ters, thereby violating a core principle of the best causal 
discovery algorithms. 

So, in the end, we obtain a characterization of Bell's 
theorem that is quite far from its standard characteriza- 
tion as a denial of "local realism" . The nebulous assump- 
tion of "realism" is replaced with Reichenbach's principle 
that correlations should be explained causally. To get a 
contradiction, it is sufficient to supplement Reichenbach's 
principle with an assumption that is rather different from 
Bell's notion of local causality, namely, the assumption 
that the causal parameters in the model are not fine- 
tuned. As we shall see, the latter assumption and the 
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fact that there are no superluminal signals together imply 
the lack of superluminal causal influences, which is Bell's 
notion of local causality. Another advantage of this char- 
acterization of Bell's theorem is that the assumptions of 
Reichenbach's principle and no fine-tuning also rule out 
superdeterminism and certain kinds of retrocausal influ- 
ences, so that these no longer consist of reasonable ways 
of avoiding the contradiction. 



eters. The causal parameters describe the functional re- 
lations that fix the values of every variable X given its 
parents Pa(X) in the causal structure, that is, for every 
X they describe a function / specifying X = f(Pa(X)). 
The probabilistic parameters specify a probability distri- 
bution over the exogenous variables, that is, a distribu- 
tion P(X) for every exogenous X. 

An example of a deterministic causal model is given in 
Fig. [1] 



II. CAUSAL STRUCTURES AND CAUSAL 
MODELS 

The modern approach to the formal study of causality 
considers in some detail the significance of interventions 
and counterfactuals for defining the notion of a causal 
relation [TJ [5]. There is a large literature on whether these 
sorts of definitions are adequate [7]. Although questions 
of this sort are relevant to a discussion of Bell's theorem, 
they will not be the focus of this article. We begin by 
describing the mathematical formalism that is relevant 
for describing the causal discovery algorithms in Refs. [1] 
and We follow the presentation of these authors. 

A causal structure is a set of variables V and a set of 
ordered pairs of distinct variables (X, Y) specifying that 
X is a direct cause of Y relative to V. 

Being in a relationship of direct causation is a prop- 
erty that is defined relative to the set of variables being 
considered. If one considers a larger set which includes 
more variables, then what was a direct causal relation in 
the first set might become a mediated causal relation in 
the second. 

Such causal structures can be represented conveniently 
by directed acyclic graphs (DAGs). A directed graph 
G corresponds to a set of vertices and a set of directed 
edges among the vertices (a vertex cannot be connected 
to itself). The acyclic property asserts that there are no 
directed paths in the graph that begin and end at the 
same vertex. DAGs represent causal structures in the 
obvious manner: every variable in V is represented by a 
vertex, and for every pair of variables (X, Y) where X is 
a direct cause of Y, there is a directed edge in the graph 
between the associated vertices 2 . 

As is standard, we use the terminology of family rela- 
tions in the obvious manner: if A is a cause of Y, direct 
or mediated, then X is said to be an ancestor of Y, and 
Y is said to be a descendent of A. If A is a direct cause 
of Y, then X is said to be a parent of Y. The vari- 
ables in the causal structure that have no parents will be 
called exogenous, while those with parents will be called 
endogenous. 

A deterministic causal model consists of a causal 
structure and a set O of causal and probabilistic param- 




Model parameters 

P(S) 
P(T) 
P{U) 
P(V) 
P{W) 

X = f x (S,T,U,W) 
Y = f Y {T,V,W) 

FIG. I: An example of a deterministic causal model. 

The notion of a general causal model can be explained 
as follows. We start with a deterministic causal model 
and modify it in a particular way. When an exogenous 
variable U is the parent of only a single other vari- 
able, say X (i.e. it is not a common cause of two or 
more variables), it is possible to eliminate U from the 
causal structure, and to replace the deterministic depen- 
dence of X on its original set of parents with a prob- 
abilistic dependence on its new set of parents. Specif- 
ically, if the deterministic causal model specifies that 
X = f(Pa(X)) for some function / (here Pa(X) in- 
cludes U) then the new causal model specifies a con- 
ditional probability P(X\Pa'(X)) (here Pa'(X) are the 
parents relative to the new causal structure, which ex- 
cludes U). Specifically, the conditional probability is de- 
fined by P(X\Pa(X)) = 6 XJ(Pa , (x)tU) P(U). 

A general causal model consists of a causal structure 
and a set O of causal-statistical parameters. The causal- 
statistical parameters specify a conditional probability 
distribution for every variable given its causal parents, 
P(X\Pa(X)). Exogenous variables have the null set for 
their causal parents, so that to condition on their par- 
ents is not to condition at all. Consequently, the causal- 
statistical parameters specify the distributions over the 
exogenous variables 3 . 

An example of a general causal model is given in Fig. [5] 
It can be obtained from the deterministic causal model of 
Fig. [I] by eliminating the exogenous variables U and V . 



2 One can imagine more general notions of causation wherein directed 
cycles are allowed, but we will not consider such notions here. 



3 Such models are sometimes called Markovian. A more general 
sort of model, which allows bi-directed edges representing the ex- 
istence of a common cause for a pair of variables, are called semi- 
Markovian 
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(Note that one need not eliminate all exogenous variables 
from a deterministic causal model to obtain a nondeter- 
ministic causal model- for instance, S and T have not 
been eliminated in our example.) 




FIG. 2: An example of a causal model consisting of a 
causal structure, represented by a directed acylic graph, 
and a set of causal-statistical parameters, specifying the 

probability of each variable conditioned on its parents. 

Deterministic causal models are clearly a special case 
of causal models where all conditional probabilities cor- 
respond to deterministic functions. It is also clear that 
for any given causal model, one can always view it as 
arising from a deterministic causal model by excluding 
some exogenous variables. To obtain such a determin- 
istic extension of a causal model, it suffices to add new 
exogenous variables as parents of every endogenous vari- 
able in the model. For the rest of the article, we will focus 
on the general notion of a causal model, rather than on 
deterministic causal models. 

We pause to discuss briefly the possible interpretation 
of the probabilities in a causal model. One could take 
a Bayesian attitude towards the probabilities appearing 
in a causal model. In this case, the marginal proba- 
bility on an exogenous variable U represents an agent's 
degrees of belief about U, and the conditional proba- 
bility P(X\Pa(X)) represents degrees of belief about X 
given its parents. Another possibility is to take a fre- 
quentist attitude towards the probabilities. This is ar- 
guably the position adopted by Pearl, who describes the 
auxiliary variables appearing in a deterministic extension 
of a causal model as 'unmeasurable conditions that Na- 
ture governs by some undisclosed probability function' 
(PQ, p. 44). One could even interpret the probabilities 
as propensities, indicating an irreducible randomness in 
one's theory (an option that might be appealing to some 
when considering the possibility of explaining quantum 
correlations in terms of causal models). Our conclusions 
here will be independent of this choice . 

It is worth noting that the definition of a causal model 
implies that exogenous variables should be independently 
distributed. The idea is that we are trying to explain all 
correlations by a causal mechanism, so that one should 



4 Although we ultimately favor the Bayesian interpretation. 



include in the model sufficiently many variables that any 
correlation between two variables can be deduced from 
the causal structure. The exogenous variables are, by 
definition, the variables that one takes to be uncorrelated. 
In this sense, the definition of a causal model incorporates 
Reichenbach's principle: if two variables are correlated 
then either one is the cause of the other, or there is a 
common cause. This is not an exclusive or - it could 
be that two variables have both a common cause and a 
direct causal relation between them. 

Consider the following question: given a causal model, 
what sorts of correlations can be observed among the 
variables? Clearly, there is a set of joint distributions 
that are possible, depending on the causal-statistical pa- 
rameters that we add to the causal structure to get a 
causal model. 

Consider the example from Fig. [5] It is clear that the 
causal model predicts that the joint distribution over all 
the variables should be 

P(X, Y, S, T, W) = P(W)P(S)P(T)P(Y\T,W) 

xP(X\Y,S,T,W). (2.1) 

In general, a causal model with variables V = 
{Xi, . . . , X n } predicts a joint distribution of the form 

P(X 1 ,...,X n )= JJ P(Xi\Pa(Xi)). (2.2) 

Essentially, one multiplies together the conditional prob- 
abilities for every variable given its parents, all of which 
are specified by the causal model. For a DAG that is 
not a complete graph (i.e not every pair of nodes is con- 
nected by an edge), the probability distributions that it 
supports are a subset of the possible distributions over 
those variables. 

We now turn to another question: what properties do 
all distributions consistent with a given causal structure 
have in common? In other words, what are the features 
of the joint probability distribution that depend only on 
the causal structure and not the causal-statistical param- 
eters? Conditional independence (CI) relations are an 
example of such properties, and they are the ones that 
most causal discovery algorithms focus upon. 

Recall that variables X and Y are conditionally inde- 
pendent given Z, denoted 

(X _L Y | Z) 

if any of the following three equivalent conditions hold 

1. P(X\Y,Z) = P(X\Z) yp(Y,z) > 0, 

2. p(y\x,z) = p(y\z) yp(x,z) > 0, 

3. P(X,Y\Z) = P{X\Z)P{Y\Z) VP(Z) > 0. 

An intuitive account of each of these conditions is as fol- 
lows: In the context of already knowing Z, (1) learning Y 
teaches you nothing about X (i.e. Y teaches you nothing 
more about X than what you already could infer from 
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knowing Z), (2) learning X teaches you nothing about Y, 
and (3) X and Y are uncorrelated. Note that marginal 
independence of X and Y, where P(X,Y) = P{X)P(Y), 
is simply conditional independence where the condition- 
ing set is the null set. 

The definition of conditional independence implies that 
certain logical inferences hold among CI relations. In 
other words, a set of CI relations need not be logically 
independent. In particular, the semi-graphoid axioms 
specify some inferences that can be drawn among CI re- 
lations. They are: 

Symmetry: (X _L Y \ Z) ^> (Y _L X \ Z) 
Decomposition: (X _L YW \Z) (X ±Y\Z) 
Weak Union: (X _L YW \ Z) =4> (X _L Y \ ZW) 
Contraction: (X _L Y \ Z) and (X _L W \ ZY) 
=> (X _L YW | Z) 

Any set of variables can be considered as a new variable, 
so each of the variables X,Y, W and Z appearing in the 
axioms should be understood as possibly representing a 
set of variables. These axioms are quite intuitive. De- 
composition, for instance, states that if, in the context of 
knowing X, learning W and Y teaches you nothing about 
U, then learning W alone teaches you nothing about U. 

Note that if one wants to specify all the CI relations 
that hold for a given probability distribution, it suffices 
to specify a generating set, defined to be a set from which 
the rest can be obtained by the semi-graphoid axioms. In 
this paper, the conditional independence relations will 
typically be specified by a generating set. 

With these tools in hand, we can now discuss the cen- 
tral result concerning what properties of a joint probabil- 
ity distribution can be inferred from the causal structure. 

Theorem 1 (Causal Markov condition) In the joint 
distribution induced by a causal structure, every vari- 
able X is conditionally independent of its nondescendants 
given its parents, 

(X _L Nd(X) | Pa{X)). 



This result follows from Eq. (2.2) because 



Ppf|Pa(JT),Nd(X)) 
_ P(X,Pa(X),Nd(X)) 
~ P(Pa(X),Nd(X)) ' 
= PpqPaPQ) IIy e Pa(x),Nd(x) P(Y\P*(Y)) 

IlyePa(x),Nd(x) P{Y\Pa(Y)) 
= P(X|Pa(X)). 



(2.3) 



The causal Markov condition implies a CI relation for 
every variable that is not exogenous in the causal struc- 
ture. One can then infer additional CI relations from 
these by the semi-graphoid axioms. 

To see these ideas in action, consider again the example 
from Fig. [2j It turns out that (Y _L S\T) for this causal 



structure, as we now demonstrate. Applying the causal 
Markov condition to Y, one infers that (Y _L XS\WT). 
Applying it to W, S and T one infers (W _L ST), (S _L 
WT) and (T _L WS) respectively. By the decomposition 
axiom, (Y JL XS\WT) implies (Y _L S\WT). From the 
contraction axiom, (Y _L S\WT) and (S _L WT) imply 
(S -L YWT). Finally, from weak union we obtain (S _L 
KW^T) and then from decomposition again we have (S _L 
Y\T), which is equivalent by symmetry to (Y A. S\T). 

We see that it can be rather laborious to infer CI re- 
lations from the causal Markov condition and the semi- 
graphoid axioms. Fortunately, there is a graphical crite- 
rion for identifying such relations, known as d-separation 
U]. We will not dwell on this notion here, but we present 
a brief introduction in App. [AJ 

Note that in addition to the CI relations that are 
implied by the causal structure, the particular causal- 
statistical parameters may imply other such relations. 
Such additional CI relations are problematic for causal 
discovery algorithms, as we shall see. 



III. CAUSAL DISCOVERY ALGORITHMS 

We have described the correlations that are possi- 
ble for a given causal structure. Causal discovery al- 
gorithms seek to solve the inverse problem: starting 
from correlations among observed variables, can one infer 
which causal structures might account for these correla- 
tions? Researchers in this area have indeed devised some 
schemes for narrowing down the set of causal structures 
that can yield a natural explanation of the correlations, 
wherein the notion of naturalness at play is one that we 
shall make explicit shortly. The algorithms look to the 
conditional independences among the variables to infer 
information about the causal structure. 

In general, causal discovery algorithms may be applied 
directly to experimental data and in this case one needs 
to deal with the subtle issue of how to infer conditional 
independence relations from a finite sample of a proba- 
bility distribution. However, we are here going to apply 
the causal discovery algorithms directly to the distribu- 
tions prescribed by quantum theory, so we needn't worry 
about this subtlety. 

It is worth reviewing a few basic facts about the output 
of causal discovery algorithms. First of all, two differ- 
ent causal structures might support precisely the same 
probability distributions, so that observation of one of 
these distributions necessarily leaves one ignorant about 
which causal structure is at play. As an example, for 
three variables, the three causal structures show in Fig. [3] 
all support the same set of probability distributions - 
those wherein A and B are conditionally independent 
given C (these are the DAGs wherein A and B are d- 
separated given C). (The general conditions under which 
two causal structures are observationally equivalent is 
given by theorem 1.2.8 in Ref. pQ.) 

It follows that causal discovery algorithms will nec- 
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FIG. 3: The three causal models consistant with the CI 
relation (A _L B | C) 



essarily sometimes yield an equivalence class of causal 
structures. When this occurs, additional information is 
required if one is to narrow the causal structure down to 
a unique possibility, for instance information about the 
temporal order of some of the variables. 

Despite this, one can often narrow down the field of 
causal possibilities significantly. To get a feeling for how 
this works, it is useful to start with a very simple exam- 
ple. Suppose that one has three binary-valued variables, 
denoted A, B and C. Suppose further that the joint dis- 
tribution over the triple, P{A, B, C) is such that 



(A _L B) i.e. 
{AJLC) i.e. 
{BJLC) i.e. 



P(A,B) 
P(A,C) 
P(B,C) 



P(A)P(B), 
P(A)P(C), 
P{B)P(C). 



(3.1) 



What is the natural causal explanation for this sort of 
correlation? It is as shown in Fig. [4] The marginal inde- 
pendence of A and B is explained by their being causally 
independent. 



FIG. 4: The natural causal model for the set of CI 



given in Eq. (3.1 1 



However, there are other possible causal explanations, 
such as the one given in Fig. [5] The reason this is a pos- 
sible explanation is because there are two causal mecha- 
nisms by which A and B could become correlated, and 
it could be that the two types of correlations combine in 
such a way as to leave A and B marginally independent. 
For this to happen, however, the parameters in the causal 
model cannot be chosen arbitrarily and it is in this sense 
that the explanation is less natural than the one provided 
by Fig. [4] 




FIG. 5: An unnatural causal model for the set of CI 



given in Eq. (3.1 1 



We adopt the following notational convention (inspired 
by the representation of mixtures in quantum theory) 

P(A) = 
P(A,B) = 



[x] means P (A = x) = 1, 

[x][y] = [xy] means P (A = x, B = y) = 1. 



Consider the following joint distribution, which has the 
dependences described in Eq. (13. lb, 



P(A,B,C) = -[000] + ^ 
We can easily verify that 

P(A,B)=(±[0]-\ 



1 



1, 



[010] + -[100] + -[111]. (3.2) 



;[0] + ^[l] 



so that A and B are indeed marginally independent. We 
also have 



1, 



P(A,C) = P(B,C) = -[00] 



1, 



1, 



4 [io] + i [n], 

so that A and C are marginally dependent, as are B and 
C . 

The natural explanation is achieved by assuming that 
the causal structure is as given in Fig. |4| and the priors 
over the exogenous variables and the conditional proba- 
bilities for the endogenous variables are as follows: 



P(A) 

P(B) 
P(C\A,B) 



1 



i 

2 L " J ' 2' 

>] + ~[l], 
[A-B], 



where A ■ B denotes the product of the values of A and 
B. Thus in this causal model, A and B are each chosen 
uniformly at random, and C is obtained as their product 
(equivalently, the logical AND of A and B). One can 
easily verify that P{A)P(B)P(C\A, B) yields the distri- 
bution of Eq. (pT2l. 



The alternative explanation assumes the causal struc- 
ture of Fig. [5| with parameters 



P{C) = 


\n 




P{B\C = 


0) = 


lm + 


P(B\C = 


1) = 




P{A\B = 


0,C 


= 0) = 


P{A\B = 


1,C) 


= [C}. 



o] + 5 [i], 



(We need not specify P(A\B = 0, C = 1) because P(B = 
0, C = 1) = 0.) The joint distribution one obtains is 
again that of Eq. (3.2). 



The difference between the two explanations becomes 
clear when we vary the parameters. If we change the 
parameters in the first model, for instance to 



An example helps to make all of this more explicit. 
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[1], 

-w")[A®B], 
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where © denotes addition modulo 2, then the joint dis- 
tribution is no longer of the form of Eq. (3.2 1, but it 



is still true that A is independent of B, while A and C 
are dependent, and B and C are dependent. On the 
other hand, modifications to the parameters in the sec- 
ond model do not preserve the pattern of dependences 
and independences among A, B and C. 

The first causal structure explains the pattern of statis- 
tical dependences and independences in a manner that is 
robust to changes in the parameters of the causal model, 
whereas the second causal structure does not. Causal 
discovery algorithms therefore favour the first model over 
the second. 

In the example we have used, all of the variables in 
the causal model were observed variables. In general 
(and especially in a quantum context), one might only 
observe a subset of the variables that are part of the 
causal model. Even in this case, however, one should pre- 
fer those causal models wherein the conditional indepen- 
dences in the probability distribution over the observed 
variables are stable to changes in the causal-statistical 
parameters. 

This is the main assumption of the causal discovery 
algorithms, usually called faithfulness 2 or stability^. 
It is the key assumption in our analysis, so we highlight 
it: 

Faithfulness: The probability distribution induced 
by a causal model M (over the variables in M or some 
subset thereof) is faithful if its conditional independences 
continue to hold for any variation of the causal-statistical 
parameters in M. 

In other words, causal discovery algorithms assume 
that any conditional independences in the observed 
statistics are not a result of fine-tuning of the causal- 
statistical parameters. All independences should be a 
consequence of the causal structure alone. For almost 
any probability density over the parameter space, the 
parameter choices that can explain the statistical depen- 
dences in question within the unnatural causal structures 
will be found to have measure zero. 

The second major assumption of causal discovery al- 
gorithms is an appeal to Occam's razor, an assumption 
that one should favour the most simple or most minimal 
model that explains the statistics. Again, it can be ap- 
plied both for the case where the observed variables are 
all the variables in the causal model, or the case where 
they are some subset thereof. 

A causal model M will be said to simulate another 
causal model M' on a set of variables V if for every choice 
of causal-statistical parameters on M', there is a choice 
of causal-statistical parameters on M such that M yields 
the same distribution over V as M' does. We can now 
define the assumption of minimality. 

Minimality: Given two causal models M and M' 
that induce a given probability distribution over a set of 
observed variables Vo (in general a subset of the variables 
postulated by each causal model), if M' can simulate 
M on Vo but M cannot simulate M' on Vo, then M is 



preferred to M' as a causal explanation of the probability 
distribution over Vo- 

At first sight, it might seem odd to prefer M over M' 
given that M is consistent with fewer distributions over 
V than M' is. But the fact that M can explain less than 
M' implies that M is more falsifiable than M', and in the 
version of Occam's razor espoused by causal discovery al- 
gorithms, the degree of falsifiability is the figure of merit 
that one seeks to optimize. More falsifiable theories are to 
be preferred because, in Pearl's words, "they provide the 
scientist with less opportunities to overfit the data "hind- 
sightedly" and therefore command greater credibility if a 
fit is found" ([I], p. 49). It follows that a causal model is 
deemed most simple if it has the least expressive power, 
while still doing justice to the observed probability dis- 
tribution. Note that M might be preferred to M' as a 
causal explanation of the probability distribution over Vo 
even though M may require more latent variables and/or 
more causal arrows than M'; "the preference for simplic- 
ity [...] is gauged by the expressive power of a structure, 
not by its syntactic description." (T , p. 46). We will see 
some examples of the consequences of the assumption of 
minimality shortly. 

It is worth remembering that causal discovery algo- 
rithms are fallible. They are best considered a heuris- 
tic, an inference to the best explanation. Indeed, Pearl 
likens the faithfulness assumption in causal discovery to 
the following kind of inference: you see a chair before 
you and infer that there is a single chair rather than 
two chairs positioned such that the one hides the other 
(PQ> P- 48). The task of causal discovery can be under- 
stood as "an inductive game that scientists play against 
Nature" (pQ, p. 42). 



A. Example of causal discovery assuming no latent 
variables 

Variables that are not observed but which are causally 
relevant are called latent variables, or hidden variables. 
In this section, we assume that the observed variables 
are the only causally relevant variables, i.e. that there 
are no hidden variables. We look at a particular exam- 
ple of how faithfulness can help to determine candidate 
causal structures from a pattern of dependences in this 
case. The scheme is equivalent to the one introduced by 
Wermuth and Lauritzen [8]. 

Suppose one is interested in answering the question 
"Does smoking cause lung cancer?" For each member 
of a population of individuals, the value of a variable S 
is known, indicating whether the individual smoked or 
not, and the value of a variable C is known, indicating 
whether they developed cancer or not. Suppose a correla- 
tion between S and C is observed. Furthermore, suppose 
that one also has access to a third variable T, indicating 
whether the individual had tar in their lungs or not, and 
suppose that it is found that S and C are conditionally 
independent given T. In other words, after conditioning 



on whether or not there is tar in the lungs, smoking and 
lung cancer are no longer correlated. Finally, imagine 
that these three variables are assumed to be the only 
causally relevant ones (we will consider the alternative 
to this assumption further on) . What causal structure is 
natural given the observed conditional independence re- 
lation? Because we wish to make it very clear how these 
algorithms work, we will not simply specify what causal 
structure they return. Instead, we will look "under the 
hood" of these algorithms. 

We begin by considering every possible hypothesis 
about the causal ordering. A causal ordering of vari- 
ables is an ordering wherein causal influences can only 
propagate from one variable to another if the second is 
higher in the order than the first. 

®—^© 

FIG. 6: The most general DAG for the causal ordering 
S < T < C. 

For instance, consider the causal ordering S < T < C. 
The most general causal structure consistent with such 
an ordering is given in Fig. [6] To get a causal model, we 
need to supplement this with conditional probabilities of 
every variable given its parents, that is, P(S),P(T\S), 
and P(C\T, S). The joint distribution that this model 
defines is simply 

P(S, T, C) = P(S)P{T | S)P(C | T, S). 

Given that any distribution can be decomposed in this 
form, by choosing the conditional probabilities appropri- 
ately, we can model any joint distribution P(S,T,C). 
But now we make use of the additional information we 
have about the joint distribution, namely that (S _L 
C | T) . This implies that we can take the parameters in 
the causal model to be such that P(C \T,S)= P(C | T), 
so that the joint distribution can be written as 

P(S, T, C) = P(S)P(T | S)P(C | T), 

and we can drop the causal arrow from S to C, so that 
the underlying causal structure is simply given by Fig. [7] 




FIG. 7: DAG that captures (S _L C \ T) for the causal 
ordering S <T < C. 



wherein (S _L C \ T). It is a candidate for the true causal 
structure. 

One then simply repeats this procedure for every pos- 
sible choice of the causal ordering. For instance, for the 
ordering C < T < S, the most general causal structure is 
the one shown in Fig. [8] The decomposition of the joint 

FIG. 8: The most general DAG for the causal ordering 
C <T<S. 

probability corresponding to this causal structure is 

P(S, T, C) = P(C)P(T\C)P(S\C, T), 

but the constraint (S _L C\T) implies that one substitute 
P(S \C,T) = P(S | T) in the causal model. Therefore, by 
the assumption of minimality, we drop the causal arrow 
from C to S, yielding a causal structure of the form given 
in Fig. [9] So this is another possible causal structure. 




FIG. 9: DAG that captures {S _L C \ T) for the causal 
ordering C <T < S. 

Sometimes different causal orderings lead to the same 
causal structure, for instance, the orderings T < S < C 
and T < C < S both yield the structure given in Fig. [10] 




FIG. 10: DAG that captures {S _L C | T) for the causal 
orderings T < S < C and T < C < S.. 

Other causal orderings, such as S < C < T and 
C < S < T are such that the conditional indepen- 
dence constraint does not lead to any simplification of 
the causal structure. For instance, for S < C < 
T, the joint distribution decomposes as P(S, T, C) = 
P(S)P(C\S)P(T\C,S), and none of the terms on the 
right-hand side can be simplified by (S _L C\T). These 
two orderings lead to the two causal structures in Fig. 11 



This simplified causal structure cannot generate an ar- 
bitrary probability distribution, but it can generate one 



Therefore, in this example, the six possible causal or- 
derings have led to five candidates for the causal struc- 
ture, depicted in Figs [?] [9} [10| and 11 However, the two 
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(b) 



FIG. 11: DAGs that capture (S ± C\ T) for the causal 
orderings S <C <T (fllal) and C < S < T (fTTbj) . 



causal structures shown in Fig.[TT]do not satisfy stability, 
so only the other three are viable. 

Suppose finally that in addition to the information 
about conditional independence, one has information 
which rules out certain causal orderings. For instance, in 
the example we are considering, suppose one has the ad- 
ditional information that tar in the lungs always appears 
after a person has smoked, never before. It is then rea- 
sonable to rule out any causal structure that has T < S. 
This rules out Figs [9] and 10 At the end, the only can- 



didate causal structure which is left is the one given in 
Fig. [7J which says that smoking causes tar in the lungs 
which causes lung cancer. 

Of course, it needn't be the case that these observed 
variables are the only ones that are causally relevant. For 
instance, there might be an unobserved genetic factor 
which predisposes people both to smoke and to develop 
lung cancer. Indeed, Tobacco companies were quick to 
point out the possibility of explaining the observed cor- 
relation between smoking and cancer in terms of such a 
genetic factor. So it is useful also to have causal discovery 
algorithms that allow for latent variables. 

Before moving on to algorithms that posit latent vari- 
ables, we pause to note that the algorithm described here 
is proven to be correct in the sense that if there exists a 
set of causal structures that are minimal and faithful to 
the observed correlations, then the algorithm will return 
these structures [8]. 

More efficient versions of this algorithm are described 
elsewhere, for instance, the Inductive causation (IC) al- 
gorithm described in Pearl PQ, which is equivalent to 
the SGS algorithm of Spirtes, Glymour and Schemes [2]. 
There have also been many proposals to further improve 
the efficiency of these algorithms (See Refs. pQ and [5] 
for details) . These algorithms have been proven to be 
correct in the sense that if there exist causal models that 
are minimal and faithful, then the algorithms will return 
them. 



B. Example of causal discovery allowing for latent 
variables 



Causal discovery in the case where one allows latent 
variables is more complicated. We begin by considering 
some of the consequences of the assumption of minimality 
for causal models with latent variables. 



First of all, it is clear that one needn't consider any 
causal models wherein a latent variable mediates a rela- 
tion between two observed variables, because the set of 
distributions over the observed variables that can be ex- 
plained by such a model is no greater than the set that 
can be explained by simply postulating a direct causal in- 
fluence between the observed variables. Similarly, posit- 
ing a latent variable that is a common effect of the ob- 
served variables does not change the distributions that 
can be supported on the observed variables. Latent vari- 
ables have nontrivial consequences for the observed dis- 
tribution only when they act as common causes of the 
observed variables. 

Consider the following suggestion for a causal explana- 
tion of the correlations among a set of observed variables: 
there are no causal influences among any of the observed 
variables, but there is a single latent variable that has a 
causal influence on each of them. By choosing the latent 
variable to take as many values as there are valuations of 
the observed variables, one can explain any correlation 
among the observed variables in this way. However, if 
there exists another causal model that can only reproduce 
a smaller set of possible correlations, while reproducing 
the observed correlations, then Occam's razor dictates 
that we should prefer the latter. Of course, one could 
imagine that further investigations (involving interven- 
tions, for instance) might vindicate the explanation that 
is less falsifiable over the one that is more falsifiable. This 
simply is another reminder that causal discovery algo- 
rithms are not infallible — they are heuristics for iden- 
tifying the most plausible causal explanations given the 
evidence. 

Now we come to the most subtle part of the causal 
discovery algorithms that posit latent variables. There is 
a difference between applying the criterion of minimality 
among a set of causal structures that are consistent with 
a given distribution over the observed variables and ap- 
plying the criterion of minimality among a set of causal 
structures that are consistent with a give set of condi- 
tional independence relations over the observed variables. 
As we've mentioned before, the algorithms described in 
Refs. [I] and [2] look only at the CI relations and con- 
sequently they follow the latter course. This choice is a 
significant shortcoming of current causal discovery algo- 
rithms, but we will defer this criticism until the end of 
this section. 

For the moment, we simply explain the consequences 
of this choice. To do so, it is useful to divide the causal 
structures that are consistent with a given distribution 
over a set of observed variables into two sorts. The first 
kind is such that all the latent variables it posits are 
common causes for at most two of the observed vari- 
ables. We'll say that such a causal structure is limited 
to pairwise confounding. The other kind is unrestricted, 
so that more than two observed variables can be directly 
influenced by a single latent variable. 

It is possible to show 9J that for a given set of CI 
relations among a set of observed variables, if a causal 
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model M generates those CI relations faithfully (that is, 
as a consequence of the causal structure, rather than the 
causal parameters), then there is another causal model 
M' that achieves the same CI relations faithfully but 
which is limited to pairwise confounding. The assump- 
tion of minimality makes M' preferred to M. 

Therefore, if one is only applying the criterion of min- 
imality among a set of causal structures that are con- 
sistent with the CI relations among the observed vari- 
ables, then Occam's razor dictates that one need only 
look among causal models that are pairwise confound- 
ing. This is precisely what the standard causal discovery 
algorithms do. As such, one can make use of a simplified 
graphical language to express the set of causal structures 
that can be output by these algorithms. Rather than us- 
ing DAGs that include both the latent and the observed 
variables in the causal structure, it is convenient to use a 
graph which only includes the observed variables as nodes 
but uses a larger variety of edges among these nodes to 
specify the causal relation that might hold among the as- 
sociated variables. For instance, a double-headed arrow 
between variables X and Y signifies that there is a com- 
mon cause of X and Y (Fig. 12). An arrow that has a 



circle rather than an arrowhead at one end represents ei- 
ther a common cause or a direct causal influence or both 
(Fig. 13). Finally, an undirected edge with a circle at 



its head and tail represents any of the five possible ways 



in which a pair of variables might be related (Fig. 14). 



In this way, a set of causal structures that include latent 
variables can be summarized in a single graph. Following 
Pearl, we call such graphs patterns 5 . 



xW 




FIG. 12: The interpretation of a bidirected edge in 
terms of a DAG. 



In order to infer which set of causal structures (includ- 
ing latent variables) is consistent with a given pattern, 
it is not sufficient to simply substitute for every undi- 
rected edge (or bi-directed edge or directed edge with 
decorated tail) all the possibilities consistent with that 
edge, as enumerated in Figs. [12] [13] and [14] One must 
eliminate some of the combinations. The definition of a 
v-structure in a DAG is a head-to-head collision of two 
arrows on a node such that the parents do not exert any 
direct causal influence on one another. The prescription 



5 More precisely, the analogue of the particular graphs we consider 
here are Pearl's "marked patterns". These have also been called 
"partially oriented inducing path graphs" in SGS. We will follow 
the notational convention of SGS rather than those of Pearl when 
drawing such graphs. 
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FIG. 13: The interpretation of a directed edge with a 
circle at its tail in terms of DAGs. 
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FIG. 14: The interpretation of an undirected edge with 
circles at head and tail in terms of DAGs. 



for finding all the DAGs consistent with a pattern is to 
consider all the combinations of possibilities that do not 
create a new v-structure. 

The IC* algorithm described in Pearl [T| (which is 
equivalent to the Causal Inference (CI) algorithm de- 
scribed in SGS [2]) takes conditional independence re- 
lations as input and returns a pattern. This algorithm is 
correct in the sense that if there exist causal structures 
that are faithful to the observed CI relations, then the 
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algorithm will return the minimal structures within this 
set. We will not review the details of the algorithm here, 
but we will apply it to a simple example to get a feeling 
for how it works. 

Consider the smoking example again, where the ob- 
served variables S,T and C are found to satisfy S _L 
C | T. The pattern returned by the IC* algorithm in this 



case is shown in Fig. 15 




FIG. 15: Output pattern of IC* algorithm for input 
S ±C\T. 

For each undirected edge in this pattern, there are five 
possibilities in the DAG for what connection holds be- 
tween the nodes, as displayed in Fig. [14] In Fig. [16] we 
display all twenty-five combinations of such possibilities. 
We have also shaded out each of the combinations that 
introduces a new v-structure - these combinations are 
not candidates for the causal structure according to the 
IC* algorithm. Hence, the nine causal structures that 
remain are the candidates returned by IC*. 

How does this answer embody the principles of causal 
discovery? First, the fact that the pattern admits only 
pairwise confounding is a consequence of the particu- 
lar sort of minimality assumption going into these algo- 
rithms, as we discussed at the beginning of this section. 
This is the reason that we do not find in the output of 
the algorithm any latent variable that is a common cause 
of all three variables S, T and C. 

Now consider the question of why there is neither a 
direct causal influence between S and C nor a latent 
variable that acts as a common cause for the pair. The 
answer is simply that if either of these sorts of influences 
were acting, then we would not find {S _L C\T); learn- 
ing S would teach us something about C even though T 
is known. In the context of our example, this eliminates 
the possibility put forward by the tobacco companies of a 
hypothetical genetic factor that both predisposes people 
to smoke and to get lung cancer. 

We need not consider the cases where there is also no 
connection between S and T nor the cases where there is 
also no connection between T and C because by assump- 
tion (S _L C\T) is the only CI relation and therefore 
{S IT) and {TIC). 

It follows that the twenty-five structures displayed in 



Fig. 16 are the only possibilities that remain among all 
possible causal structures, so to explain why the output 
of the algorithm is justified we need only explain why we 
should eliminate those that introduce a new v-structure. 
First note that if one conditions on a variable that is the 
common effect of two other variables, then we expect a 
dependence between those variables (for instance, in dig- 
ital logic, knowing that the output of an AND gate is 



implies that the two inputs cannot both be 1). There- 
fore for each causal structure that includes a v-structure 
on T, we would expect that conditioning on T induces 
a dependence between the roots of the v-structure, and 
because one of these roots is always correlated with S 
and the other with C, this would imply a dependence 
between S and C, contradicting the fact that (S _L C\T). 
Alternatively, we can infer that a causal structure includ- 
ing a v-structure on T contradicts the relation (S _L C\T) 
using the d-separation criterion. 

What does this imply about whether smoking causes 
lung cancer? Suppose that we make use of the same addi- 
tional information as we considered in Sec. |III A} namely, 
that tar in the lungs is always found to occur after smok- 
ing, never before. We can then eliminate all causal struc- 
tures with an arrow from T to S. What remains are the 
three options in Fig. 17 They are: (i) smoking causes 



tar in the lungs which causes cancer, (ii) there is a latent 
variable that is a common cause of smoking and having 
tar in the lungs, and (iii) both mechanisms are in play. 
If option (ii) holds then smoking is not a cause of cancer 
and, unlike the hypothesis of a genetic factor that predis- 
poses people both to smoke and to develop lung cancer, 
it is consistent with the observation that tar screens off 
smoking from cancer. Of course, this hypothesis remains 
implausible if one cannot identify (or imagine) any factor 
that screens off smoking from tar in the lungs. 

We previously highlighted the fact that the causal dis- 
covery algorithms of Refs. pQ and [2] apply the principle 
of minimality within the set of causal structures that are 
consistent with the CI relations in the observed distri- 
bution, not within the set of those that are consistent 
with the observed distribution itself. This can be a prob- 
lem because these two sets of causal structures can be 
different [3]. 

It is best to illustrate this with an example. Consider 
the case of a triple of observed variables, X, Y and Z. 
We will compare two causal models. The first posits a 
latent variable A which has a direct causal influence on all 
three observed variables. The second posits three latent 
variables, A, (i and v, each of which has a direct causal 
influence on a distinct pair of observed variables 6 . The 
two models are illustrated in Fig. [18] 

The two structures imply precisely the same set of CI 
relations among the observed variables, namely, the null 
set. However, there are distributions over the triple of 
observed variables that are only consistent with the first 
model and not the second. For instance, a joint distri- 
bution wherein the three observed variables X, Y and 
Z are close to perfectly correlated 7 cannot be generated 



6 This causal scenario has also been considered in the context of a 
discussion of nonclassical correlations in Ref. |10| . 

7 We cannot take the case where they are perfectly correlated because 
we want our example to be of a distribution that is faithful to the 
first causal structure and perfect correlation would imply that any 
two variables are conditionally independent given the third. 
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FIG. 16: The causal structures returned by the IC* algorithm when the input is a distribution over observed 
variables S, T and C with (S _L C\T). Those that introduce a new v-structure are shaded out. 




FIG. 17: The causal structures that remain if the 
ordering S < T is assumed. 



from the second causal structure for any choice of causal 
parameters. Therefore, if this is the distribution one has 
observed, then the second causal structure is not a can- 
didate for the underlying causal model. However, the CI 
relations one observes for such a distribution are consis- 
tent with the second causal structure. So if the input to 
one's causal discovery algorithm is limited to these rela- 
tions, the algorithm can return a causal structure that 
is inconsistent with the observed distribution. Indeed, 
Occam's razor prefers the second structure to the first, 
so the causal algorithms would output a causal structure 
that is actually inconsistent with the observed distribu- 
tion. 

We will see that this sort of failure mode of the causal 
discovery algorithms is exactly what occurs when one 




FIG. 18: Two candidate causal structures for explaining 
correlations between X, Y and Z using latent variables. 



applies them to correlations that violate a Bell inequality. 



IV. APPLYING CAUSAL DISCOVERY 
ALGORITHMS TO QUANTUM CORRELATIONS 

We now turn to the question of what these algorithms 
tell us about quantum correlations. We consider only 
Bell-type experiments involving two systems, two pos- 
sible settings for each measurement and two possible 
outcomes for each measurement. Let S and T be the 
bit-valued variables that specify which measurement was 
performed on the left and right wings of the experiment 
respectively, and let A and B be the bit-valued variables 
that specify the outcomes of the measurements on the 
left and right wings respectively. 
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Bell's theorem derives constraints on P(AB | ST) from 
assumptions about the causal structure [11]. These as- 
sumptions — which Bell justified by appeal to the space- 
like separation of the two wings of the experiment and 
the impossibility of superluminal causal influences — are 
that A is the joint effect of the setting variable S and a 
common cause variable A, while B is the joint effect of 
the setting variable T and A. The causal structure corre- 



sponding to this assumption is presented in Fig. 19 




FIG. 19: The causal structure corresponding to Bell's 
notion of local causality. 

This structure implies the following conditional inde- 
pendence relations, 



(A _L BT | SX) and (B _L AS \ TX). 

Bell called his assumption local causality and formal- 
ized it in terms of these conditional independences. These 
in turn imply that P(AB \ STX) = P(A \ SX)P(B \ TX), 
which is known as factorizability. From this condition, 
together with the assumption that there are no correla- 
tions between the settings and the hidden variables, 

(S _L TX) and (T _L SX), 

one can infer that P(AB \ ST) must satisfy the Bell in- 
equalities jTTJ [T2]. Bell's assumption about the causal 
structure also implies no superluminal signalling: 

No-signalling: (A _L T \ S) and (B _L S \ T). (4.1) 

The fact that quantum correlations can violate Bell 
inequalities shows that they cannot be explained using 
the causal structure of Fig. [19] 

We will now consider the inverse problem to the one 
considered by Bell. Rather than attempting to infer 
constraints on correlations from assumptions about the 
causal structure, we will attempt to infer conclusions 
about possible causal structures from the nature of the 
correlations implied by quantum theory. This is the sort 
of problem that the causal discovery algorithms were de- 
signed to solve. 

We will contrast two examples of quantum correla- 
tions: one which violates the Bell inequalities and the 
other which satisfies the Bell inequalities. 

For the latter, we will take a version of the Einstein- 
Podolsky- Rosen (EPR) experiment [12] in terms of qubits 



(first proposed by Bohm for spin-1/2 systems 14 ). The 
pair are prepared in the maximally entangled state 

\y) = l=(\+ z )\+ z ) + \-z)\-z)) (4.2) 

where \±z) are the eigenstates of spin along the z axis. 
On each wing, the two choices of measurement are be- 
tween the same pair of mutually unbiased bases, for in- 
stance, measurements of spin along the z or x axes, as 
illustrated in Fig. [20] In this case, if the same measure- 
ment is made on both wings (both z or both x), one 
sees perfect correlation between the outcomes, while if 
different measurements are made (z on one and x on 
the other), then one sees no correlation between the out- 
comes. It is well known that these sorts of correlations 
do not violate any Bell inequality, which is to say that 
they can be explained by a locally causal model. 





(a) Left wing 
measurement 



(b) Right wing 
measurement 



FIG. 20: Measurement axes for generating EPR 



correlations given the quantum state of Eq. (4.2) 



The other sort of correlation we consider will be those 
exhibited in the Clauser-Horne-Shimony-Holt (CHSH) 
experiment. We can take the pair of spins to be pre- 
pared in the same maximally entangled state \^>) as for 
the EPR scenario, and the pair of measurements on the 
left wing to also be of spin along the z or x axes. How- 
ever, on the right wing, the pair of possible measure- 
ments are of spin along the (z + x)j\[2 axis or along the 
(z — x)/ \/2 axis, as indicated in Fig. [2l] In this case, 
one finds that the probability of correlation for the cases 
(S, T) = (0, 0), (1, 0) and (0, 1) is equal to the probability 
of anticorrelation for the cases (S, T) — (1, 1) and has the 
value \ + ^ ~ 0.85. 

The input to the standard causal discovery algorithms 
is limited to conditional independence relations, so wc 
begin by computing the conditional independences that 
hold for the EPR and CHSH experiments. Rather than 
specifying an exhaustive list, we provide a generating set 
(the rest can be obtained by applying the semi-graphoid 
axioms). They are: 



EPR: 



CHSH: 



(S ±T),(A±T\S),(B _L S\T), 

(AB _L S), (AB _L T). 

(S ±T),(A±T\S),(B _L S\T). 



Consider the conditions (AB _L S) and (AB _L T) seen 
in the EPR experiment. These imply, by decomposition, 
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(a) Left wing 
measurement 



(b) Right wing 
measurement 



FIG. 21: Measurement axes for generating CHSH 
correlations given the quantum state |\&) of Eq. (4.2) 



that (A _L S 1 ) and (B _L T); the outcome on a wing is in- 
dependent of the setting on that wing. While true, this 
independence is not representative of the causal struc- 
ture. Indeed, it only holds because of the degeneracy 
of the Schmidt coefficients in the maximally entangled 
state. If we instead consider the state 

|*) = VpI+^> \+z) fv^pN I-*) 

where p / 1/2, then A JL S and B JL T. Because it 
is intuitively clear that the choice of measurement does 
have a causal influence on the outcome, the indepen- 
dences (A _L S) and (B _L T) are pathological in the 
context of the causal discovery algorithms. Given that 
the EPR (CHSH) experiment with a state that is close 
to maximally entangled still satisfies (violates) the Bell 
inequalities, we consider these states instead (If one likes, 
p may be taken to be arbitrarily close to 1/2). 

We then get the following generating sets of indepen- 
dence relations, 

EPR: (S _L T) , (A _L T \ S) , (B _L S | T) , 
CHSH: (S ±T),(A±T\S),(B ±S\T), 

where (S _L T) asserts the independence of the settings, 
and (A _L T \ S) and (B _L S \ T) are the no-signalling 
conditions. The critical point is that the set of inde- 
pendences are the same for the EPR and the CHSH ex- 
periments. Because the input to the causal discovery 
algorithms that we consider is limited to conditional in- 
dependence relations, it follows that whatever causal con- 
clusions these algorithms draw, they will draw the same 
causal conclusions about the EPR experiment as they 
do about the CHSH experiment. And yet, from the fact 
that the EPR correlations satisfy the Bell inequalities, we 
know that they can be explained by local causes while 
from the fact that the CHSH correlations violate a Bell 
inequality, we know that they cannot be so explained. 

So the conclusion is that standard causal discovery 
algorithms (based on conditional independences) cannot 
possibly do justice to Bell's theorem. Independences sim- 
ply do not provide enough information. One needs a 
causal discovery algorithm that looks at the strength of 
correlations to reproduce the conclusions of Bell's theo- 



Despite the inability of the standard causal discov- 
ery algorithms to distinguish correlations that violate 
the Bell inequalities from those that satisfy them, it is 
nonetheless interesting to see what happens when one 
applies the algorithms to the set of independences we 
found for the EPR and CHSH experiments. We will re- 
fer to these as nontrivial no-signalling correlations (they 
are deemed nontrivial because they predict correlation 
between the outcomes for some choices of the settings) . 

In applying the causal discovery algorithms, we will 
assume that the setting variable on one wing is a cause 
of the outcome variable on that wing, that is, we assume 
that S is a cause of A and that T is a cause of B. This is 
presumably uncontroversial. The assumption that there 
are no causal cycles then implies that there can be no 
causal influence from A to S, nor from B to T. Nonethe- 
less, we are still permitting influences from the outcome 
on one wing to the setting on the other, although, as we 
will see, the causal discovery algorithms will rule against 
such influences. 



A. No latent variables 

It is instructive to consider the causal structure that 
arises for a single representative causal ordering of the 
variables. We take S < T < A < B. Then, the most 
general causal structure is illustrated in Fig. 
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the most general joint distribution for this ordering is of 
the form 

P(S, T, A, B) = P(S)P(T\S)P(A\S, T)P(B\S, T, A). 




FIG. 22: The most general causal structure for the 
causal ordering S < T < A < B, assuming no hidden 
variables. 

The independence (S _L T) implies that P(T\S) = 
P(T), and the independence (A _L T | S) implies that 
P(A\S,T) = P(A\S). The independence (B _L S\T) has 
no nontrivial implications for this causal ordering, hence 
the term P(B\S,T, A) cannot be simplified. From these 
CI relation it follows that the joint distribution can be 
written as 

P(S, T, A, B) = P(S)P(T)P(A\S)P(B\S, T, A), 



which corresponds to the causal structure in Fig. |23a| If 
we change the ordering of variables so that B precedes A, 
then by a similar argument, we obtain the causal struc- 
ture in Fig. 23b For every other possible causal ordering 
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(a) 



(b) 



FIG. 23: Possible causal structures for no-signalling 
correlations, assuming no hidden variables, for causal 
orderings S < T <A_< B, T < S < A < B, 
and S < T < B < A, 



S <A<T <B (23a 



T < S < B < A, T < B < S < A (23b| 



consistent with our assumption that S < A and T < B, 
we also obtain one of the causal structures of Fig. [23] 

Consider the causal structure in Fig. |23a| Although it 
faithfully captures (S _L T) and (A _L T\S) , it does not 
faithfully capture (B _L S\T) . The only way to explain 
the independence (B _L S\T) within this causal model is 
by fine-tuning of the causal parameters in the model, for 
instance, if the parameters defining P(B\S,T, A) are not 
independent of those defining P(A\S). A similar problem 
arises for the causal structure in Fig. |23b| It follows 
that in the case of no latent variables no causal structure 
can satisfy minimality and faithfulness for the conditional 
independences of nontrivial no-signalling correlations. 

Note that if, instead of applying the Wermuth- 
Lauritzen algorithm to the nontrivial no-signalling cor- 
relations, one applies the IC algorithm pQ, equivalently 
the SGS algorithm [3J, one finds that it returns a graph 
that is not a valid causal structure, signalling a failure 
of the algorithm. This is what one would expect given 
that the algorithm only promises to return a valid causal 
structure if there exists one that is minimal and faithful 
to the correlations, and in this case, there is not. 

There is an interesting lesson here for the foundations 
of quantum theory. Long before Bell's work, Einstein had 
pointed out that if one did not assume hidden variables, 
then one could only explain the EPR correlations by 
positing superluminal causes. This argument was made 
in his comments at the 1927 Solvay conference [TS] (See 
Refs. [IS] and [3] for more concerning Einstein's argu- 
ments on completeness and locality.) One can easily cast 
Einstein's argument into the mold of causal structures 
as follows. If we allow the quantum state ip, considered 
as a classical variable, as the only common cause, then 
the assumption of no superluminal causes implies that 
P(A, B\S, T, ip) = P{A\S, ip)P(B\T, if;), and given that tp 
is fixed in the experiment (it is a variable which only takes 
one possible value), this implies that A and B should be 
uncorrelated, in contradiction with the EPR correlations. 

But what the result of our analysis shows is that Ein- 
stein failed to explicitly note another mysterious feature 
of the EPR correlations, namely, that even if one was 
willing to countenance superluminal causes in an attempt 
to explain the EPR correlations without recourse to hid- 



den variables, ensuring that these superluminal causes 
cannot be used to send superluminal signals implies that 
there must be fine-tuning in the underlying causal model. 



B. Latent variables allowed 

If one simply inputs the independences of nontriv- 
ial no-signalling correlations into the IC* algorithm of 
Ref. [T], one obtains the causal diagram illustrated in 
Fig. [54] as output. 





FIG. 24: The output pattern of the IC* algorithm when 
applied to nontrivial no-signalling correlations. 

Recall that the arrows with an empty circle at their 
tail imply that one can have either a direct causal link 
or a common cause. If one believes that the settings at 
each wing are freely chosen, then one is inclined to think 
that either the setting variables S and T should be direct 
causes of A and B respectively, or that if they are not, 
then it is the common cause for A and S and the common 
cause for B and T that is freely chosen. In this case, we 
could lump the common causes into the definition of the 
setting variables without loss of generality. 

Besides this caveat about the causal relation between 
S and A and between T and B, the causal structures 
consistent with the pattern that the IC* algorithm has 
returned are precisely those that capture Bell's notion of 
local causality, illustrated in Fig. [19] 

Recall that the CHSH correlations are an instance 
of nontrivial no-signalling correlations that violate the 
Bell inequalities. Therefore, the IC* algorithm is claim- 
ing that Bell-inequality-violating correlations can be ex- 
plained by a locally causal model. However from Bell's 
theorem we know that this claim is mistaken. Therefore, 
the IC* algorithm comes to a causal conclusion that is 
incorrect. 

Of course, we already pointed out in Sec. |IV[ that 
the input of the IC* algorithm cannot distinguish Bell- 
inequality-violating from Bell-inequality-satisfying corre- 
lations. The independences we have fed into the al- 
gorithm also hold for EPR correlations. Consequently, 
had it returned the conclusion that the nontrivial no- 
signalling correlations cannot be explained by the causal 
structure of Fig. [l9j it would have also reached an incor- 
rect causal conclusion because the EPR correlations can 
be explained by such a model. 

So we reiterate our conclusion from Sec. |IV[ that causal 
discovery algorithms which look only at independences 
are inadequate to the task of establishing whether or not 
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correlations can be explained by a locally causal model. 
We require better algorithms that also take into account 
the strengths of the correlations. 

From our brief discussion in Sec. IIII Bl of the shortcom- 
ings of causal discovery algorithms with latent variables 
we can also see why the algorithms have reached an incor- 
rect conclusion. The problem is that a causal structure 
with latent variables that reproduces the CI relations of 
a given distribution might not be capable of reproducing 
the distribution itself. In particular, the causal structure 
of Fig. [24] reproduces the CI relations of the distribution 
P(A,B,S,T) defined by the CHSH experiment, but it 
cannot reproduce the distribution itself. 



C. Some proposed causal explanations of quantum 
correlations 

We now apply the ideas behind causal discovery algo- 
rithms to a few of the existing proposals for providing 
a causal explanation of Bell-inequality-violating correla- 
tions. We consider three: superluminal causation, su- 
perdeterminism, and retrocausation. 




FIG. 25: Examples of causal structures that posit 
superluminal causal influences to explain Bell 
correlations. 



1. Superluminal causation 

One option for explaining Bell correlations causally is 
to assume that there are some superluminal causes, for 
instance, a causal influence from the outcome on one wing 
to the outcome on the other, or from the setting on one 
wing to the outcome on the other, or both. In the most 
general case one allows hidden variables that can causally 
influence the measurement outcomes. The possibilities 
are illustrated in Fig. [25] 

But the same problem arises for these sorts of causal 
explanations of Bell-inequality violations as arise for the 
causal explanations without hidden variables that were 



discussed in Sec. IV A Given the superluminal causes 
from one wing to the other, the only way to explain the 
lack of superluminal signals is through a fine-tuning of 
the causal parameters. 



For instance, in Fig. 25c the correlations set up be- 
tween S and B along the direct causal path could cancel 
with those set up by the causal path through A. (The 
path through A cannot set up correlations between S and 
B because there is a collider on A in this path and we 
are not conditioning on A.) Such a cancelation requires 
fine-tuning of the parameters of the model. 

To salvage no-signalling for the causal structure of 
Fig. 25a we need a different sort of fine-tuning (a similar 



sort of fine-tuning mechanism can also be used for the 
causal structure of Fig. 25b). For instance, it could be 



that A = (Ai,A2) where Ai is a binary variable that is 
uniformly distributed and that Y is a function of S © Ai , 
T and A2. In this case, we can ensure that (Y _L S\T) by 
virtue of the special distribution on A 1; which is a kind 
of fine-tuning. 



Note that this is precisely the sort of causal structure 
that is assumed in the Toner and Bacon model [17) . where 
Bell-inequality violations are simulated by classical com- 
munication 8 . This model also involves fine-tuning insofar 
as signalling is prohibited only for a special distribution 
over the shared random variables posited by the model. 

The deBroglie-Bohm interpretation is a prominent ex- 
ample of a model that seeks to provide a causal explana- 
tion of Bell correlations using superluminal causal influ- 
ences. Consider the deBroglie-Bohm interpretation of a 
relativistic theory such as the model of QED provided by 
Struyve and Westman [18] , or else of a nonrelativistic the- 
ory wherein the interaction Hamiltonians are such that 
there is a maximum speed at which signals can propagate. 
In both cases, it is presumed that there is a preferred rest 
frame that is hidden at the operational level. In a Bell 
experiment, if the measurement on the left wing occurs 
prior to the measurement on the right wing relative to the 
preferred rest frame, then there is a superluminal causal 
influence from the setting on the left wing to the out- 
come on the right wing, mediated by the quantum state, 
which is considered to be a part of the ontology of the 
theory [19]. (Note that no causal influence from the out- 
come of the first experiment to the outcome of the second 
is required because the outcomes are deterministic func- 
tions of the Bohmian configuration and the wavefunc- 
tion.) It follows from our analysis that the parameters in 
the causal model posited by the deBroglie-Bohm inter- 



s This model works even when the measurement setting for each 
qubit is chosen arbitrarily, rather than being limited to the two 
settings of the CHSH experiment. 
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pretation must be fine-tuned in order to explain the lack 
of superluminal signalling. 

Valentini's version of the deBroglie-Bohm interpreta- 
tion makes this fact particularly clear. In Refs. (2Ql [21] 
he has noted that the wavefunction plays a dual role in 
the deBroglie-Bohm interpretation. On the one hand, 
it is part of the ontology, a pilot wave that dictates the 
dynamics of the system's configuration (the positions of 
the particles in the nonrelativistic theory). On the other 
hand, the wavefunction has a statistical character, spec- 
ifying the distribution over the system's configurations. 
In order to eliminate this dual role, Valentini suggests 
that the wavefunction is only a pilot wave and that any 
distribution over the configurations should be allowed as 
the initial condition. It is argued that one can still recover 
the standard distribution of configurations on a coarse- 
grained scale as a result of dynamical evolution [2"2"] . 
Within this approach, the no-signalling constraint is a 
feature of a special equilibrium distribution. The ten- 
sion between Bell inequality violations and no-signalling 
is resolved by abandoning the latter as a fundamental 
feature of the world and asserting that it only holds as 
a contingent feature. The fine-tuning is explained as the 
consequence of equilibration. (It has also been noted in 
the causal model literature that equilibration phenomena 
might account for fine-tuning of causal parameters |23|.) 
Conversely, the version of the deBroglie-Bohm interpre- 
tation espoused by Diirr, Goldstein and Zhangi [M] - 
which takes no-signalling to be a non-contingent feature 
of the theory - does not seek to provide a dynamical ex- 
planation of the fine-tuning. Consequently, it seems fair 
to say that the fine-tuning required by the deBroglie- 
Bohm interpretation is less objectionable in Valentini's 
version of the theory. 



2. Superdeterminism 

Another option for a causal explanation of quantum 
correlations is to posit that the settings are not free but 
are causally influenced by other variables. 

For instance, the hidden variable A (which correlates 
the outcomes) might causally influence one or both of the 
setting variables, as illustrated in Figs. |26a| and |26b} Al- 
ternatively, one can posit the existence of a second hidden 
variable fi that is a common cause for the setting on one 
wing and the outcome on the other wing, as illustrated in 
Fig. |26c| More complicated possibilities would have fi as 
a common cause of a subset of three of the settings and 
outcomes. Note that the possibility of a latent variable 
that is a common cause of A and one or both settings has 
not been excluded; it is incorporated into the first case. 
This is because any such variable could just be absorbed 
into the definition of A without loss of generality. The 



A and B is not correlated with the common cause of S 
and B. 




scenario in Fig. 26c could also be considered a special 
case of the one in Fig. |26a[ if we include n into the defi- 
nition of A. Nonetheless, it is useful to separate out this 
second case because it posits that the common cause of 



(c) 



FIG. 26: Some causal structures that exploit the 
superdeterminism loophole to explain Bell correlations. 

All of the causal influences posited in such models can 
be taken to be subluminal. However, such explanations 
of the Bell correlations are clearly in conflict with the 
notion that the settings can be freely chosen by the ex- 
perimenter. To assert one of these causal structures as 
a way to resolve the mystery of Bell's theorem is an in- 
stance of what is commonly known as the "superdeter- 
minism" loophole. But, just as with positing superlu- 
minal causal influences, these causal structures are not 
faithful to the observed correlations because one or more 
of the observed CI relations - S _L T (independence 
of settings), (A _L T\S) (no-signalling from left to right) 
and (B _L S\T) (no signalling from right to left) - can 
only be satisfied by fine-tuning of the parameters in the 
causal model. This is a novel sort of objection against 
the notion of a superdeterministic explanation of Bell- 
inequality- violations, independent of an appeal to free 
will. 

It is worth devoting a few words to the sort of fine- 
tuning that is required. First note that in the context of 
abandoning the assumption of free will, the no-signalling 
constraint must be reinterpreted as an observed statis- 
tical independence, rather than a statement about the 
consequences of an intervention on a setting variable. Of 
course, this statistical independence is still observed and 
therefore must still be reproduced by the causal model. 
In the causal structure of Fig. |26a[ if we define A* to be 
that part of A that is correlated nontrivially with S, then 
we require that A* _L B despite the arrow from A to B. 
We can still do justice to the Bell correlations by hav- 
ing A* correlated with only the parity of A and B, while 
remaining uncorrelated with B. This is an instance of 
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fine-tuning. 

Similar fine-tuning tricks can be used to ensure that 



(B _L S\T) in the causal structures of Figs. 26b and 26c 



3. Retrocausation 

"Retrocausation" refers to the possibility of causal in- 
fluences that act in a direction contrary to the standard 
arrow of time. It has been proposed as a means of resolv- 
ing the mystery of Bell-inequality violations 25-2"§] by 
purportedly saving the relativistic structure of the the- 
ory: rather than having causal influences propagating 
outside the light cone, they propagate within the light 
cone although possibly within the backward light cone. 

It is useful to distinguish two approaches to retrocausal 
explanations of Bell correlations: those that add cycles 
to the causal structure and those that do not. Given 
that the former take us outside the framework of directed 
acyclic graphs, we will confine our attention to acyclic 
retrocausation. 

Price has described the idea of a retrocausal model of 
Bell inequality violations in Ref . [3D] . It is not completely 
clear whether he has in mind a model that posits cycles 
or not. However, he does argue that one way to generate 
a retrocausal model is to start with a superdcterminis- 
tic model and to simply reverse the causal arrows that 
lead into the settings. For the examples of superdeter- 
minism we have considered, such reversals lead to acyclic 
retrocausal models. For instance, if one starts with the 



superdeterministic causal structure of Fig. 26a and re- 
verses the A — > S arrow, one obtains the causal structure 
of Fig. |27a| where setting S is a cause of the hidden vari- 
able A. If one assumes that S is chosen freely at a time 
to the future of when A is set, then this model is clearly 
retrocausal. 

Alternatively, consider taking the superdeterministic 
model of Fig. [26c] and reversing the /i — > S arrow, to ob- 
tain the causal structure of Fig. |27c[ If [i were presumed 
to be space-like separated from both S and B, it would 
simply mediate a superluminal causal influence from S 
to B. However, if one posits that /i is in the common 
future of S and B, then we can imagine that there is 
a causal influence from S to /J, that is subluminal, and 
one from ijl to B that is retrocausal. Alternatively, if one 
posits that /i is in the common past of S and B, then 
the causal influence from S to [i must be assumed to be 
retrocausal. 

Note that if one views spatio-temporal relations as su- 
pervening upon causal relations, rather than vice-versa, 
then there is no freedom to specify the spatio-temporal 
location of /i and the distinction drawn above is not 
meaningful. Even if one takes spatio-temporal notions 
to be primary, the fact that the location of n seems to be 
mere window-dressing in the context of a causal explana- 
tion of Bell-inequality violations undermines the distinc- 
tion between retrocausation and superluminal causation. 

Fine-tuning is just as necessary within the retrocausal 
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FIG. 27: Causal structures that exploit the 
retrocausation loophole to explain Bell correlations. 



explanations as it was in the ones that posited superlu- 
minal influences or superdeterminism. Without it, one 
would obtain a correlation between 5* and B, in con- 
tradiction with their observed statistical independence. 
Indeed, if these causal structures could be supplemented 
with arbitrary causal parameters, then one could use the 
causal chain of influence that extends from S to B to 
send a signal. 



V. CONCLUSIONS 

Our two main conclusions are as follows. First, 
causal discovery algorithms that appeal only to con- 
ditional independences among observed variables can- 
not distinguish between Bcll-incquality-violating and 
Bell-inequality-satisfying correlations. Better algorithms 
which look to the strength of correlations are needed to 
do justice to Bell's theorem. 

Second, and more importantly, we have shown that 
any causal model which can reproduce Bell-inequality 
violations while respecting the observed independences 
-the marginal independence of the measurement settings 
and the no-signalling condition- will necessarily violate a 
principle that is at the core of all the best causal discovery 
algorithms, namely, that observed independences should 
not be explained by fine-tuning of the causal parameters 
in the model. This is true for all explanatory strate- 
gies that fit within the framework of directed acyclic 
graphs supplemented with conditional probabilities, in- 
cluding models that posit superluminal causes, models 
that exploit the superdeterminism loophole, and models 
that posit retrocausation while avoiding causal cycles. 

The topic of causal discovery is still relatively young. 
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The best algorithms available today are not likely to be 
the final story. Indeed, our analysis suggests that the 
tools that have been developed in the literature on the 
foundations of quantum theory for assessing the possi- 
bility of local explanations of correlations may well be 
important for developing causal discovery algorithms. If 
one could deliver on this promise, then it would be an 
interesting example of the field of quantum foundations 
having applications in other fields, such as statistics and 
machine learning, and via these, in medicine, genetics, 
economics and other disciplines wherein causal discovery 
plays a prominent role. 

Conversely, it is our view that there is a great deal more 
insight to be gained about the foundations of quantum 
theory from the literature on causal models and causal 
discovery algorithms. We consider a few possible direc- 
tions of research along these lines. 

As mentioned previously, defining causality in a man- 
ner that does not make reference to temporal ordering 
provides a language by which one could hope to describe 
a fundamental theory wherein spatio-temporal notions 
are emergent and notions of causal structure are prim- 
itive. In such a theory, it would not be the case that 
a cause was defined to be prior in time to its effects, 
but rather the notion of the temporal order of two events 
would be defined in terms of whether one event was a po- 
tential cause of the other. Consequently, the framework 
for causal inference provides a natural arena in which 
to pursue the idea that space-time is emergent, a notion 
that is popular in attempts to unify general relativity 
with quantum theory [32 E2] • 

There are a number of results in the quantum founda- 
tions literature that have the following form: make some 
assumptions about the causal structure and derive in- 
equalities on the correlations that can be obtained from 
these classically. Svetlichny's inequalities are an example 
of this [33], wherein one considers a triple of measure- 
ments at space-like separation and one allows a mixture 
of causal structures wherein superluminal influences can 
propagate between any two of the wings of the experi- 
ment. The topic has been studied in Refs. [51H5B] , Fritz 
has also recently derived inequalities on classical correla- 
tions for some causal structures that do not correspond 
to the standard Bell scenario 10J. Such results are exam- 
ples of a general approach to correlations that has been 
developed in the causal model literature. For instance, 
In Pearl's book (Sec. 8.4), inequalities on correlations are 
derived from assumptions about the causal structure in 
a section considering noncompliance in drug trials. Pearl 
points out the similarity between these "instrumental" 
inequalities and the Bell inequalities, and adds: "The 
instrumental inequality can, in a sense, be viewed as a 
generalization of Bell's inequality for cases where direct 
causal connection is permitted to operate between the 
correlated observables, X and Y." It will be interesting to 
see how many results in the quantum foundations litera- 
ture can be considered to be instances of such generalized 
inequalities. 



Finally, by exploiting a quantum analogue of condi- 
tional probability proposed by Leifer [37] and developed 
by Leifer and Spekkens [38J and an associated quan- 
tum analogue of conditional independence (see Leifer and 
Poulin [40 , for instance), one can hope to explore a gen- 
eralization of the notion of causal model to a quantum 
causal model. A quantum causal model is naturally de- 
fined as a quantum causal structure, which is a directed 
acyclic graph wherein each node is a quantum system, 
and a set of quantum causal parameters, which consti- 
tute a set of conditional quantum states (the quantum 
analogue of conditional probability) for every node given 
its causal parents. Insofar as one can accommodate clas- 
sical variables as special cases of quantum systems (corre- 
sponding to commuting algebras), one can describe cor- 
relations among settings and outcomes within quantum 
causal models. 

Quantum causal models make similar assumptions 
about the possibilities for causal structure as do classical 
causal models (no cycles for instance), and they make 
similar assumptions about the consequences of causal 
structure for statistical independences, but they replace 
the formalism of classical probability theory with a non- 
commutative generalization thereof. If one can make the 
case that the formalism of quantum causal models is not 
just a mathematical artifice but can be given a sensi- 
ble interpretation as a form of causal explanation, then 
such models can provide a causal explanation of Bell- 
inequality violations without requiring fine-tuning. 

Note, however, that if the conditional probabilities 
that appear in classical causal models are interpreted as 
degrees of belief - and we take this to be the most sensible 
interpretation - then the transition from classical causal 
models to quantum causal models involves not only a 
modification to physics, but a modification to the rules 
of inference. In this view, the correct theory of inference 
is not a priori but empirical. Nonetheless, one cannot 
simply declare by fiat that some formulation of quantum 
theory is a theory of inference. One must justify this 
claim. At a minimum, one must determine how stan- 
dard concepts in a theory of inference generalize to the 
quantum domain. One could also reconsider the various 
proposals for axiomatic derivations of classical probabil- 
ity theory, for instance, that of Cox [41] or that of de 
Finetti [55], to see whether a reasonable modification of 
the axioms yields a quantum theory of inference 9 . Ide- 
ally, one would show that if quantum causal models imply 
a modification to both our physics and to our theory of 
inference, then these modifications are not independent. 
After all, the physics determines the precise manner in 
which an agent can gather information about the world 
and in turn act upon it and so the physics should deter- 



9 Fuchs and Schack have also suggested that parts of quantum theory 
can be derived by an appeal to dutch-book coherence following de 
Finetti [43]. 
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mine what is the most adaptive theory of inference for 
an agent. It is in this sense that the project of defining 
quantum causal models is not yet complete and only with 
such a completion in hand can one really say that a causal 
explanation of the Bell correlations without recourse to 
fine-tuning has been achieved. 
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Given a DAG G, a path between two vertices X and 
Y is any set of edges and vertices which connects X and 
Y, regardless of the direction of the edges. We say that 
a path between X and Y is blocked by a set of vertices 
Z if at least one of the following conditions holds 



1. The path contains a chain (Fig, 28c I, or a fork 
(Fig. |28b[) such that C is in Z. 



2. The path contains a collider (Fig. 28a I such that C 
is not in Z and no descendant of C is in Z. 



We then have the following definition of d-separation: 



Appendix A: d-separation 

Conditional independence relations are captured in di- 
rected acyclic graphs by the notion of distance-separation 
or d-separation. First let us introduce the basic ele- 
ments of which a DAG may be composed of; these are 
colliders, forks, and chains; which for three variables 



a DAG G with 
are d-separated by 



A, B, C are illustrated in Fig. 28 



(a) Collider (b) Fork, 

(c) Chain 

FIG. 28: Basic structures found in DAGs. 



Definition 2 (d-separation) Given 
vertices V, two vertices X, Y G V 
a set of vertices Zc V, written (X _L Y\Z), if and only 
if Z blocks all paths between X and Y 



d-separation is a relation among three sets of variables 
in a DAG. If one is interpreting DAGs as causal net- 
works (as in this article), then d-separation must repre- 
sent a causal relation among the three sets of variables. 
By contrast, conditional independence represents a sta- 
tistical relation among them. One might say that X is 
causally screened off from Y given Z whenever X is d- 
separated from Y given Z. Of course, the significance 
of this causal relation is found in the statistical distribu- 
tions that can be supported by the causal structure. A 
set of variables X is d-separated from the set Y given the 
set Z in a causal structure if and only if for all probability 
distributions over the causal structure, X is conditionally 
independent of Y given Z. 



